Univariate linear models are used to analyse variation in alpha diversity, but there are several ways to estimate alpha diversity. The following are options for calcuating various alpha diversity metrics from a species (columns)-sample (rows) matrix or dataframe using functions in the vegan package.
Here we use the phyloseq object called phy that we generated during the first chapter. This object has been manipulated to include only OTUs that have been classified as fungi and successfully classified at the phylum level, but no other manipulation has been performed.
# load library
library(vegan)
# make a four-panel plot
par(mfrow=c(2, 2))
# species richness
hist(specnumber(otu_table(phy)), main='Richness')
# Shannon diversity
hist(diversity(otu_table(phy)), main='Shannon diversity')
# Simpson diversity
hist(diversity(otu_table(phy), index='simpson'), main='Simpson diversity')
# Pielou's evenness (Shannon diversity / log richness)
hist(diversity(otu_table(phy)) / log(specnumber(otu_table(phy))),
main="Pielou's evenness")
See ?diversity for more information. Note that you can calculate indices on a transposed matrix, in which species are in rows and samples in columns, using the MARGIN = 2 argument.
Making comparisons among communities using these metrics is only sensible if the sampling effort was high enough to attain a high level of coverage (i.e., all species present within a community were likely to be observed during sampling) or communities were assessed with similar levels of sampling effort. If this is not the case, we may not have confidence that our richness estimates are meaningful. We can check the sampling effort associated with these samples by calculating the rowSums and plotting rarefaction curves.
# show variable sampling depth
summary(sample_sums(phy))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2613 37981 45280 40438 53276 61384
# plot species accumulation curves for each sample
# set step at a high number to speed up plotting
rarecurve(otu_table(phy), step=500)
# add lines indicating sampling depth
abline(v=sample_sums(phy), lty='dotted', lwd=0.5)
We can see five samples that all appear to be undersampled; note that these are the five ‘outlier’ samples in our network graph and are probably the same ordination. The sample-effort curves for the rest of the samples are flattening out, suggesting that they are adequately sampled. We can also estimate coverage to evaluate whether this observation is supported.
# load library
library(QsRutils)
# calculate Good's coverage (only uses singletons)
summary(goods(otu_table(phy)))
## no.sing no.seqs goods
## Min. : 26.00 Min. : 2613 Min. :98.12
## 1st Qu.: 48.50 1st Qu.:37981 1st Qu.:99.82
## Median : 60.50 Median :45280 Median :99.86
## Mean : 63.12 Mean :40438 Mean :99.66
## 3rd Qu.: 77.00 3rd Qu.:53276 3rd Qu.:99.89
## Max. :104.00 Max. :61384 Max. :99.94
Keep in mind, however, that singletons are likely to be underrepresented in these data since many of these would have been excluded during OTU clustering.
We can extrapolate the richness in each community based on the shape of species accumulation curves and perform analyses on these extrapolated estimates. We can use the estimate_richness function on our phyloseq object to calculate extrapolated species richness, here using the Chao1 index.
We will compare the relationship between observed and extrapolated richness using the plot_ly function from the plotly library; this function provides some nice functionality in that we can add layers to the plot for obtaining more information regarding individual data points.
# load library
library(plotly)
# calculate extrapolated species richness
rich <- estimate_richness(phy, measures=c('Observed', 'Chao1'))
plot_ly(rich, x=~Observed, y=~Chao1) %>%
add_markers(text=rownames(rich))
Let’s check whether extrapolated species richness appears to be sensitive to sequencing depth.
rich <- cbind(rich, readcounts=sample_sums(phy))
plot_ly(rich, x=~readcounts, y=~Chao1) %>%
add_markers(text=rownames(rich))
Those same five samples have much lower extrapolated richness than the other samples. We cannot be certain whether richness within these samples is being estimated accurately. Therefore, we will delete these samples from the data.
If the error estimates assocated with extrapolated richness are too high, we might calculate richness from our community matrix based on a constant effort across all samples using the rarefy_even_depth function. This resamples the data randomly to obtain an OTU matrix with equal sampling depth for each sample (default is resampling with replacement). Setting a threshold that is above the sequencing depth for those five samples results in them being removed from the data. Note the use of the rngseed argument so that the result is the same each time the script is run.
# identify the sampling depth for the samples to keep
head(sort(sample_sums(phy)), 10)
## JP31 JP12 JP1 JP9 JP19 JP32 JP10 JP15 JP7 JP25
## 2613 3541 3868 4314 8277 26264 33083 33132 39597 39685
# use the 'rarefy_even_depth' function to resample reads
# set 'rngseed' so that outcome is reproducible
# save the output in a new object
phy.rar <- rarefy_even_depth(phy, 26264, rngseed=101)
Alternatively, if the only thing that we want to do is remove the samples without doing any resampling, we can do this with the following code:
# keep only samples with at least 26264 reads
phy <- subset_samples(phy, sample_sums(phy) >= 26264)
Finally, let’s extract the sample data from the phyloseq object and store as an object in the workspace, here named dat. We will add our diversity estimates to it and then use this dataframe for subsequent analyses.
# extract sample data and change to data.frame class
# note that 'as.data.frame' leaves as 'sample_data' object,
# which does not work with downstream functions
dat <- data.frame(sample_data(phy))
# split 'Treatment' into separate variables for 'N' and 'P'
dat$N <- factor(with(dat, ifelse(Treatment %in% c('N', 'NP'), 'yesN', 'noN')))
dat$P <- factor(with(dat, ifelse(Treatment %in% c('P', 'NP'), 'yesP', 'noP')))
# species richness
dat$richness <- specnumber(otu_table(phy))
# extrapolate richness
dat$Chao1 <- estimate_richness(phy, measures='Chao1')$Chao1
# Simpson diversity
dat$simpson <- diversity(otu_table(phy), index='simpson')
# Pielou's evenness (Shannon diversity / log richness)
dat$pielou <- diversity(otu_table(phy)) / log(specnumber(otu_table(phy)))
Now we can analyse the data.
First let’s produce a bargraph to check whether richness is influenced by sample type and fertiliser treatment.
# plot the data
ggplot(dat, aes(x=SampleType, y=Chao1, fill=Treatment)) +
# produce bars showing mean for each group
stat_summary(geom="bar", fun.y=mean, position=position_dodge(width=0.9)) +
# produce error bars showing standard error for each group
# set 'mult=2' for 95% confidence interval
stat_summary(geom="errorbar", fun.data=mean_se, fun.args=list(mult=1),
position=position_dodge(width=0.9), width=0.25) +
# change the axis labels
ylab('OTU count') +
scale_x_discrete(name='Sample Type', labels=c('Root samples', 'Soil samples')) +
# customise theme
theme_bw() +
# change the legend title, labels and position
scale_fill_manual(name='Fertiliser treatment',
labels=c('-', 'N', 'P', 'NP'),
values=c('grey', 'yellow', 'purple', 'green')) +
theme(legend.position='top')
Note: be careful when customising labels, these need to be specified in the order associated with the levels of each factor:
levels(dat$Treatment)
## [1] "control" "N" "P" "NP"
levels(dat$SampleType)
## [1] "root" "soil"
Check out the RColorBrewer library for more appealing colour schemes.
Then let’s fit a model and inspect the results.
# fit the model and store result in an object
# 'a * b' is the same as 'a + b + a:b'
m1.rich <- lm(Chao1 ~ SampleType * N * P, data=dat)
# check diagnostic plots
par(mfrow=c(1, 2))
qqPlot(m1.rich)
## JP2 JP21
## 1 17
residualPlot(m1.rich)
There might be some bias, with more variation with smaller means. If we saw the opposite, we might try a data transformation (e.g., sqrt(richness)) or fitting a generalised linear model (e.g., glm(lm(richness ~ SampleType * Treatment, data=dat, family=poisson))). What we have observed is tricky to address though. Let’s ignore it for now.
Let’s look at the significance of our predictors and visualise the model predictions.
# load libraries
library(car)
library(visreg)
library(emmeans)
# Anova table
Anova(m1.rich)
## Anova Table (Type II tests)
##
## Response: Chao1
## Sum Sq Df F value Pr(>F)
## SampleType 29658 1 5.6070 0.02865 *
## N 8241 1 1.5581 0.22711
## P 3078 1 0.5819 0.45494
## SampleType:N 8207 1 1.5515 0.22805
## SampleType:P 20262 1 3.8307 0.06518 .
## N:P 44 1 0.0083 0.92815
## SampleType:N:P 3201 1 0.6052 0.44617
## Residuals 100500 19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# plot model predictions
visreg(m1.rich, xvar='P', by='SampleType', overlay=T)
visreg(m1.rich, xvar='N', by='SampleType', overlay=T)
One thing that we haven’t accounted for is the nested experimental design. The fertiliser treatments are applied in a randomised complete block design (therefore nested within BlockID) and we have taken multiple samples from each plot (therefore nested within PlotID). We may expect that these grouping variables might explain some of the variation in our data if other drivers of diversity covary in space.
# load the library that contains 'spread'
library(tidyr)
# split the 'Chao1' column into two based on 'SampleType'
# select the columns that we are interested in for this to work
temp <- spread(select(dat, PlotID, BlockID, SampleType, Chao1),
key='SampleType', value='Chao1')
# look at first few rows to see what was done
head(temp)
## PlotID BlockID root soil
## 1 3 1 497.6379 474.8095
## 2 4 1 315.1081 509.4054
## 3 5 1 360.8571 418.1200
## 4 8 1 NA 536.6176
## 5 16 2 455.9000 421.3913
## 6 18 2 376.0000 NA
# plot the result
plot_ly(temp, x=~root, y=~soil, color=~BlockID) %>%
add_markers(text=~PlotID)
## Warning: Ignoring 5 observations
# tidy the workspace
rm(temp)
There doesn’t appear to be much pattern when it comes to BlockID but there may be a positive correlation based on samples taken from the same plot. To test this, we can fit a mixed model using functions in the lme4 library and inspect the results.
# load library
library(lme4)
# fit the model and store result in an object
m2.rich <- lmer(Chao1 ~ SampleType * N * P + (1|BlockID/PlotID), data=dat)
## boundary (singular) fit: see ?isSingular
# look at random effect block
VarCorr(m2.rich)
## Groups Name Std.Dev.
## PlotID:BlockID (Intercept) 51.6327441
## BlockID (Intercept) 0.0010101
## Residual 52.2942366
# check diagnostic plots (note differences from 'lm' object)
qqPlot(resid(m2.rich))
## JP15 JP21
## 12 17
plot(m2.rich)
# view ANOVA table (extra argument to calculate Kenward-Rogers DF)
Anova(m2.rich, test='F')
## Analysis of Deviance Table (Type II Wald F tests with Kenward-Roger df)
##
## Response: Chao1
## F Df Df.res Pr(>F)
## SampleType 6.9176 1 8.2985 0.02925 *
## N 0.6994 1 10.2037 0.42215
## P 0.8548 1 8.7225 0.38008
## SampleType:N 2.8615 1 8.8767 0.12545
## SampleType:P 7.3355 1 8.8273 0.02447 *
## N:P 0.0046 1 10.7330 0.94727
## SampleType:N:P 0.1439 1 8.8580 0.71333
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 'P' has a significant interactions with 'SampleType'
# calculate marginal means to interpret interactions
m2.emm.P <- emmeans(m2.rich, ~ P | SampleType)
## NOTE: Results may be misleading due to involvement in interactions
pairs(m2.emm.P)
## SampleType = root:
## contrast estimate SE df t.ratio p.value
## noP - yesP 83.7 39.8 14.0 2.103 0.0540
##
## SampleType = soil:
## contrast estimate SE df t.ratio p.value
## noP - yesP -38.6 42.3 14.8 -0.913 0.3757
##
## Results are averaged over the levels of: N
Based on this model, adding phosphorus increased richness for root samples but not soil samples. The variance estimates confirmed our observation that PlotID but not BlockID explained variation in the response
Including continuous variables as predictors greatly changes the way in which we produce plots, and to a certain extent changes the way in which we fit and interpret the results of models.
For this example, we look at the relationship between Chao richness and available soil phosphorus for each sample type.
# make a plot showing how 'Chao1' changes with
# 'Extractable_P_Soil' for the two sample types
ggplot(dat, aes(x=Extractable_P_Soil, y=Chao1,
color=SampleType)) +
# add a line and ribbon showing the fitted model and confidence interval
stat_smooth(method='lm', aes(fill=SampleType), show.legend=F) +
# add points representing the raw data
geom_point() +
# modify axis labels
xlab('Available Phosphorus (ppt)') + ylab('OTU richness (Chao1)') +
# change colours and modify legend
scale_fill_manual(values=c('green', 'brown')) +
scale_color_manual(name='Sample Type',
labels=c('Root samples', 'Soil samples'),
values=c('green', 'brown')) +
# customise the theme
theme_bw()
There doesn’t appear to be much of a relationship with available phosphorus, but let’s go through the steps to make sure.
# fit the model and store result in an object
m3.rich <- lmer(Chao1 ~ SampleType * Extractable_P_Soil + (1|PlotID), data=dat)
# check diagnostic plots
qqPlot(resid(m3.rich))
## JP2 JP21
## 1 17
plot(m3.rich)
# view ANOVA table (extra argument to calculate Kenward-Rogers DF)
Anova(m3.rich, test='F')
## Analysis of Deviance Table (Type II Wald F tests with Kenward-Roger df)
##
## Response: Chao1
## F Df Df.res Pr(>F)
## SampleType 5.1438 1 11.985 0.04262 *
## Extractable_P_Soil 0.0512 1 14.311 0.82418
## SampleType:Extractable_P_Soil 0.7823 1 13.581 0.39184
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# check model summary to interpret the results further
summary(m3.rich)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Chao1 ~ SampleType * Extractable_P_Soil + (1 | PlotID)
## Data: dat
##
## REML criterion at convergence: 261.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.24409 -0.49330 -0.01384 0.57359 1.98640
##
## Random effects:
## Groups Name Variance Std.Dev.
## PlotID (Intercept) 826.6 28.75
## Residual 5051.8 71.08
## Number of obs: 27, groups: PlotID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 395.12 28.32 13.953
## SampleTypesoil 91.83 40.62 2.261
## Extractable_P_Soil 312.42 651.47 0.480
## SampleTypesoil:Extractable_P_Soil -767.09 843.14 -0.910
##
## Correlation of Fixed Effects:
## (Intr) SmplTy Ex_P_S
## SampleTypsl -0.619
## Extrctb_P_S -0.691 0.437
## SmplT:E_P_S 0.480 -0.724 -0.700
# 'emtrends' may be more intuitive
emtrends(m3.rich, 'SampleType', 'Extractable_P_Soil')
## SampleType Extractable_P_Soil.trend SE df lower.CL upper.CL
## root 312 671 23.0 -1075 1700
## soil -455 616 22.9 -1728 819
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95